63 research outputs found
GPU-based Acceleration of Symbol Timng Recovery
This paper presents a novel implementation of graphics
processing unit (GPU) based symbol timing recovery using
polyphase interpolators to detect symbol timing error. Symbol
timing recovery is a compute intensive procedure that detects
and corrects the timing error in a coherent receiver. We
provide optimal sample-time timing recovery using a maximum
likelihood (ML) estimator to minimize the timing error.
This is an iterative and adaptive system that relies on
feedback, therefore, we present an accelerated implementation
design by using a GPU for timing error detection (TED),
enabling fast error detection by exploiting the 2D filter structure
found in the polyphase interpolator. We present this hybrid/
heterogeneous CPU and GPU architecture by computing
a low complexity and low noise matched filter (MF) while
simultaneously performing TED. We then compare the performance
of the CPU vs. GPU based timing recovery for different
interpolation rates to minimize the error and improve
the detection by up to a factor of 35. We further improve the
process by utilizing GPU optimization and performing block
processing to improve the throughput even more, all while
maintaining the lowest possible sampling rate.Laboratory for Telecommunications SciencesNational Science Foundation (NSF
Dataflow-based Design and Implementation of Image Processing Applications
Dataflow is a well known computational model and is widely used for
expressing the functionality of digital signal processing (DSP)
applications, such as audio and video data stream processing, digital
communications, and image processing. These applications usually
require real-time processing capabilities and have critical performance
constraints. Dataflow provides a formal mechanism for describing
specifications of DSP applications, imposes minimal data-dependency
constraints in specifications, and is effective in exposing and
exploiting task or data level parallelism for achieving high performance
implementations.
To demonstrate dataflow-based design methods in a manner that is
concrete and easily adapted to different platforms and back-end design
tools, we present in this report a number of case studies based on the
lightweight dataflow (LWDF) programming methodology. LWDF is designed as
a "minimalistic" approach for integrating coarse grain dataflow
programming structures into arbitrary simulation- or platform-oriented
languages, such as C, C++, CUDA, MATLAB, SystemC, Verilog, and VHDL. In
particular, LWDF requires minimal dependence on specialized tools or
libraries. This feature --- together with the rigorous adherence to
dataflow principles throughout the LWDF design framework --- allows
designers to integrate and experiment with dataflow modeling approaches
relatively quickly and flexibly into existing design methodologies and
processes
Advances in Architectures and Tools for FPGAs and their Impact on the Design of Complex Systems for Particle Physics
The continual improvement of semiconductor technology has provided rapid advancements in device frequency and density. Designers of electronics systems for high-energy physics (HEP) have benefited from these advancements, transitioning many designs from fixed-function ASICs to more flexible FPGA-based platforms. Today’s FPGA devices provide a significantly higher amount of resources than those available during the initial Large Hadron Collider design phase. To take advantage of the capabilities of future FPGAs in the next generation of HEP experiments, designers must not only anticipate further improvements in FPGA hardware, but must also adopt design tools and methodologies that can scale along with that hardware. In this paper, we outline the major trends in FPGA hardware, describe the design challenges these trends will present to developers of HEP electronics, and discuss a range of techniques that can be adopted to overcome these challenges
Multiobjective Optimization for Reconfigurable Implementation of Medical Image Registration
In real-time signal processing, a single application often has multiple computationally intensive kernels that can benefit from acceleration using custom or reconfigurable hardware platforms, such as field-programmable gate arrays (FPGAs). For adaptive utilization of resources at run time, FPGAs with capabilities for dynamic reconfiguration are emerging. In this context, it is useful for designers to derive sets of efficient configurations that trade off application performance with fabric resources. Such sets can be maintained at run time so that the best available design tradeoff is used. Finding a single, optimized configuration is difficult, and generating a family of optimized configurations suitable for different run-time scenarios is even more challenging. We present a novel multiobjective wordlength optimization strategy developed through FPGA-based implementation of a representative computationally intensive image processing application: medical image registration. Tradeoffs between FPGA resources and implementation accuracy are explored, and Pareto-optimized wordlength configurations are systematically identified. We also compare search methods for finding Pareto-optimized design configurations and demonstrate the applicability of search based on evolutionary techniques for identifying superior multiobjective tradeoff curves. We demonstrate feasibility of this approach in the context of FPGA-based medical image registration; however, it may be adapted to a wide range of signal processing applications
Using the DSPCAD Integrative Command-Line Environment: User's Guide for DICE Version 1.1
This document provides instructions on setting up, starting up, and
building DICE and its key companion packages, dicemin and dicelang. This
installation process is based on a general set of conventions, which we
refer to as the DICE organizational conventions, for software packages.
The DICE organizational conventions are specified in this report. These
conventions are applied in DICE, dicemin, and dicelang, and also to
other software packages that are developed in the Maryland DSPCAD
Research Group
The DSPCAD Lightweight Dataflow Environment: Introduction to LIDE Version 0.1
LIDE (the DSPCAD Lightweight Dataflow Environment) is a flexible,
lightweight design environment that allows designers to experiment with
dataflow-based approaches for design and implementation of digital
signal processing (DSP) systems.
LIDE contains libraries of dataflow graph elements (primitive actors,
hierarchical actors, and edges) and utilities that assist designers in
modeling, simulating, and implementing DSP systems using formal dataflow
techniques. The libraries of dataflow graph elements (mainly actors)
contained in LIDE provide useful building blocks that can be used to
construct signal processing applications, and that can be used as
examples that designers can adapt to create their own, customized LIDE
actors. Furthermore, by using LIDE along with the DSPCAD Integrative
Command Line Environment (DICE), designers can efficiently create and
execute unit tests for user-designed actors.
This report provides an introduction to LIDE. The report includes
details on the process for setting up the LIDE environment, and covers
methods for using pre-designed libraries of graph elements, as well as
creating user-designed libraries and associated utilities using the C
language. The report also gives an introduction to the C language
plug-in for dicelang. This plug-in, called dicelang-C, provides features
for efficient C-based project development and maintenance that are
useful to apply when working with LIDE
Recommended from our members
Graphics Processing Unit–Accelerated Nonrigid Registration of MR Images to CT Images During CT-Guided Percutaneous Liver Tumor Ablations
Rationale and Objectives: Accuracy and speed are essential for the intraprocedural nonrigid MR-to-CT image registration in the assessment of tumor margins during CT-guided liver tumor ablations. While both accuracy and speed can be improved by limiting the registration to a region of interest (ROI), manual contouring of the ROI prolongs the registration process substantially. To achieve accurate and fast registration without the use of an ROI, we combined a nonrigid registration technique based on volume subdivision with hardware acceleration using a graphical processing unit (GPU). We compared the registration accuracy and processing time of GPU-accelerated volume subdivision-based nonrigid registration technique to the conventional nonrigid B-spline registration technique. Materials and Methods: Fourteen image data sets of preprocedural MR and intraprocedural CT images for percutaneous CT-guided liver tumor ablations were obtained. Each set of images was registered using the GPU-accelerated volume subdivision technique and the B-spline technique. Manual contouring of ROI was used only for the B-spline technique. Registration accuracies (Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (HD)), and total processing time including contouring of ROIs and computation were compared using a paired Student’s t-test. Results: Accuracy of the GPU-accelerated registrations and B-spline registrations, respectively were 88.3 ± 3.7% vs 89.3 ± 4.9% (p = 0.41) for DSC and 13.1 ± 5.2 mm vs 11.4 ± 6.3 mm (p = 0.15) for HD. Total processing time of the GPU-accelerated registration and B-spline registration techniques was 88 ± 14 s vs 557 ± 116 s (p < 0.000000002), respectively; there was no significant difference in computation time despite the difference in the complexity of the algorithms (p = 0.71). Conclusion: The GPU-accelerated volume subdivision technique was as accurate as the B-spline technique and required significantly less processing time. The GPU-accelerated volume subdivision technique may enable the implementation of nonrigid registration into routine clinical practice
Graphics Processing Unit–Accelerated Nonrigid Registration of MR Images to CT Images During CT-Guided Percutaneous Liver Tumor Ablations
Rationale and Objectives: Accuracy and speed are essential for the intraprocedural nonrigid MR-to-CT image registration in the assessment of tumor margins during CT-guided liver tumor ablations. While both accuracy and speed can be improved by limiting the registration to a region of interest (ROI), manual contouring of the ROI prolongs the registration process substantially. To achieve accurate and fast registration without the use of an ROI, we combined a nonrigid registration technique based on volume subdivision with hardware acceleration using a graphical processing unit (GPU). We compared the registration accuracy and processing time of GPU-accelerated volume subdivision-based nonrigid registration technique to the conventional nonrigid B-spline registration technique. Materials and Methods: Fourteen image data sets of preprocedural MR and intraprocedural CT images for percutaneous CT-guided liver tumor ablations were obtained. Each set of images was registered using the GPU-accelerated volume subdivision technique and the B-spline technique. Manual contouring of ROI was used only for the B-spline technique. Registration accuracies (Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (HD)), and total processing time including contouring of ROIs and computation were compared using a paired Student’s t-test. Results: Accuracy of the GPU-accelerated registrations and B-spline registrations, respectively were 88.3 ± 3.7% vs 89.3 ± 4.9% (p = 0.41) for DSC and 13.1 ± 5.2 mm vs 11.4 ± 6.3 mm (p = 0.15) for HD. Total processing time of the GPU-accelerated registration and B-spline registration techniques was 88 ± 14 s vs 557 ± 116 s (p < 0.000000002), respectively; there was no significant difference in computation time despite the difference in the complexity of the algorithms (p = 0.71). Conclusion: The GPU-accelerated volume subdivision technique was as accurate as the B-spline technique and required significantly less processing time. The GPU-accelerated volume subdivision technique may enable the implementation of nonrigid registration into routine clinical practice
Np-click: A programming model for the intel ixp1200
The architectural diversity and complexity of network processor architectures motivate the need for a more natural abstraction of the underlying hardware. In this paper, we describe a programming model, NP-Click, which makes it possible to write efficient code and improve application performance without having to understand all of the details of the target architecture. Using this programming model, we implement the data plane of an IPv4 router on a particular network processor, the Intel IXP1200, and compare results with a hand-coded implementation. Our results show the IPv4 router written in NP-Click performs within 7 % of a hand-coded version of the same application using a realistic packet mix. 1
- …